Skip to content

feat(validator): add support to validate essential metrics produced by Kepler #1834

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

vprashar2929
Copy link
Collaborator

@vprashar2929 vprashar2929 commented Nov 4, 2024

This commit introduces functionality to validate essential metrics produced by Kepler
The following comparisons are included:

  • Node Exporter Comparison

    • Validates node_rapl_<package|core|dram> metrics against kepler_node_<package|core|dram>{dev}
  • Kepler Process Comparison

    • Compares kepler_process_<package|core|dram|platform|other|uncore>{latest} metrics to
      kepler_process_<package|core|dram|platform|other|uncore>{dev}
  • Kepler Node Comparison

    • Validates kepler_node_<package|core|dram|platform|other|uncore>{latest} against
      kepler_node_<package|core|dram|platform|other|uncore>{dev}

Additionally, the following changes are made to existing functionality:

  • Adds a new metric_validations.yaml file which includes promql queries for comparisons along with threshold values
  • Update the existing stressor.sh script to now support few more parameters to make it more flexible
    • warmup time: time to wait before starting the stressor
    • cooldown time: time to wait after the stressor is finished
    • repeats: number of times to repeat the stressor. Since for
      regression test we don't want to repeat the stressor multiple times
  • Adds a new validator-regression.yaml file which includes the configuration for the regression test

dependabot bot and others added 30 commits August 15, 2024 12:51
…1 updates

Bumps the go-dependencies group with 8 updates in the / directory:

| Package | From | To |
| --- | --- | --- |
| [github.com/beevik/etree](https://github.com/beevik/etree) | `1.4.0` | `1.4.1` |
| [github.com/cilium/ebpf](https://github.com/cilium/ebpf) | `0.15.0` | `0.16.0` |
| [github.com/onsi/ginkgo/v2](https://github.com/onsi/ginkgo) | `2.19.1` | `2.20.0` |
| [github.com/prometheus/client_golang](https://github.com/prometheus/client_golang) | `1.19.1` | `1.20.0` |
| [github.com/prometheus/prometheus](https://github.com/prometheus/prometheus) | `0.53.1` | `0.54.0` |
| [golang.org/x/time](https://github.com/golang/time) | `0.5.0` | `0.6.0` |
| [k8s.io/api](https://github.com/kubernetes/api) | `0.29.7` | `0.29.8` |
| [k8s.io/client-go](https://github.com/kubernetes/client-go) | `0.29.7` | `0.29.8` |



Updates `github.com/beevik/etree` from 1.4.0 to 1.4.1
- [Release notes](https://github.com/beevik/etree/releases)
- [Changelog](https://github.com/beevik/etree/blob/main/RELEASE_NOTES.md)
- [Commits](beevik/etree@v1.4.0...v1.4.1)

Updates `github.com/cilium/ebpf` from 0.15.0 to 0.16.0
- [Release notes](https://github.com/cilium/ebpf/releases)
- [Commits](cilium/ebpf@v0.15.0...v0.16.0)

Updates `github.com/onsi/ginkgo/v2` from 2.19.1 to 2.20.0
- [Release notes](https://github.com/onsi/ginkgo/releases)
- [Changelog](https://github.com/onsi/ginkgo/blob/master/CHANGELOG.md)
- [Commits](onsi/ginkgo@v2.19.1...v2.20.0)

Updates `github.com/prometheus/client_golang` from 1.19.1 to 1.20.0
- [Release notes](https://github.com/prometheus/client_golang/releases)
- [Changelog](https://github.com/prometheus/client_golang/blob/main/CHANGELOG.md)
- [Commits](prometheus/client_golang@v1.19.1...v1.20.0)

Updates `github.com/prometheus/prometheus` from 0.53.1 to 0.54.0
- [Release notes](https://github.com/prometheus/prometheus/releases)
- [Changelog](https://github.com/prometheus/prometheus/blob/main/CHANGELOG.md)
- [Commits](prometheus/prometheus@v0.53.1...v0.54.0)

Updates `golang.org/x/sys` from 0.22.0 to 0.23.0
- [Commits](golang/sys@v0.22.0...v0.23.0)

Updates `golang.org/x/time` from 0.5.0 to 0.6.0
- [Commits](golang/time@v0.5.0...v0.6.0)

Updates `k8s.io/api` from 0.29.7 to 0.29.8
- [Commits](kubernetes/api@v0.29.7...v0.29.8)

Updates `k8s.io/apimachinery` from 0.29.7 to 0.29.8
- [Commits](kubernetes/apimachinery@v0.29.7...v0.29.8)

Updates `k8s.io/client-go` from 0.29.7 to 0.29.8
- [Changelog](https://github.com/kubernetes/client-go/blob/master/CHANGELOG.md)
- [Commits](kubernetes/client-go@v0.29.7...v0.29.8)

Updates `k8s.io/klog/v2` from 2.120.1 to 2.130.1
- [Release notes](https://github.com/kubernetes/klog/releases)
- [Changelog](https://github.com/kubernetes/klog/blob/main/RELEASE.md)
- [Commits](kubernetes/klog@v2.120.1...v2.130.1)

---
updated-dependencies:
- dependency-name: github.com/beevik/etree
  dependency-type: direct:production
  update-type: version-update:semver-patch
  dependency-group: go-dependencies
- dependency-name: github.com/cilium/ebpf
  dependency-type: direct:production
  update-type: version-update:semver-minor
  dependency-group: go-dependencies
- dependency-name: github.com/onsi/ginkgo/v2
  dependency-type: direct:production
  update-type: version-update:semver-minor
  dependency-group: go-dependencies
- dependency-name: github.com/prometheus/client_golang
  dependency-type: direct:production
  update-type: version-update:semver-minor
  dependency-group: go-dependencies
- dependency-name: github.com/prometheus/prometheus
  dependency-type: direct:production
  update-type: version-update:semver-minor
  dependency-group: go-dependencies
- dependency-name: golang.org/x/sys
  dependency-type: direct:production
  update-type: version-update:semver-minor
  dependency-group: go-dependencies
- dependency-name: golang.org/x/time
  dependency-type: indirect
  update-type: version-update:semver-minor
  dependency-group: go-dependencies
- dependency-name: k8s.io/api
  dependency-type: direct:production
  update-type: version-update:semver-patch
  dependency-group: go-dependencies
- dependency-name: k8s.io/apimachinery
  dependency-type: direct:production
  update-type: version-update:semver-patch
  dependency-group: go-dependencies
- dependency-name: k8s.io/client-go
  dependency-type: direct:production
  update-type: version-update:semver-patch
  dependency-group: go-dependencies
- dependency-name: k8s.io/klog/v2
  dependency-type: direct:production
  update-type: version-update:semver-minor
  dependency-group: go-dependencies
...

Signed-off-by: dependabot[bot] <[email protected]>
…puting-io/dependabot/go_modules/go-dependencies-bb1f50d887

build(deps): bump the go-dependencies group across 1 directory with 11 updates
…updates

Bumps the github-actions group with 5 updates in the / directory:

| Package | From | To |
| --- | --- | --- |
| [actions/checkout](https://github.com/actions/checkout) | `3` | `4` |
| [anchore/sbom-action](https://github.com/anchore/sbom-action) | `0.16.1` | `0.17.1` |
| [actions/upload-artifact](https://github.com/actions/upload-artifact) | `4.3.4` | `4.3.6` |
| [actions/setup-python](https://github.com/actions/setup-python) | `3` | `5` |
| [ossf/scorecard-action](https://github.com/ossf/scorecard-action) | `2.3.3` | `2.4.0` |



Updates `actions/checkout` from 3 to 4
- [Release notes](https://github.com/actions/checkout/releases)
- [Commits](actions/checkout@v3...v4)

Updates `anchore/sbom-action` from 0.16.1 to 0.17.1
- [Release notes](https://github.com/anchore/sbom-action/releases)
- [Commits](anchore/sbom-action@v0.16.1...v0.17.1)

Updates `actions/upload-artifact` from 4.3.4 to 4.3.6
- [Release notes](https://github.com/actions/upload-artifact/releases)
- [Commits](actions/upload-artifact@v4.3.4...v4.3.6)

Updates `actions/setup-python` from 3 to 5
- [Release notes](https://github.com/actions/setup-python/releases)
- [Commits](actions/setup-python@v3...v5)

Updates `ossf/scorecard-action` from 2.3.3 to 2.4.0
- [Release notes](https://github.com/ossf/scorecard-action/releases)
- [Changelog](https://github.com/ossf/scorecard-action/blob/main/RELEASE.md)
- [Commits](ossf/scorecard-action@dc50aa9...62b2cac)

---
updated-dependencies:
- dependency-name: actions/checkout
  dependency-type: direct:production
  update-type: version-update:semver-major
  dependency-group: github-actions
- dependency-name: anchore/sbom-action
  dependency-type: direct:production
  update-type: version-update:semver-minor
  dependency-group: github-actions
- dependency-name: actions/upload-artifact
  dependency-type: direct:production
  update-type: version-update:semver-patch
  dependency-group: github-actions
- dependency-name: actions/setup-python
  dependency-type: direct:production
  update-type: version-update:semver-major
  dependency-group: github-actions
- dependency-name: ossf/scorecard-action
  dependency-type: direct:production
  update-type: version-update:semver-minor
  dependency-group: github-actions
...

Signed-off-by: dependabot[bot] <[email protected]>
…puting-io/dependabot/github_actions/github-actions-5a7b011f50

build(deps): bump the github-actions group across 1 directory with 5 updates
…server-patch-1

feat: add model_name attribute to ComponentModelWeights
Signed-off-by: Sunyanan Choochotkaew <[email protected]>
…r-longer-test

chore(validator): run stress test for longer
This commit introduces a workflow for testing ACPI functionality
using Equinix self-hosted runners. The workflow deploys Kepler using
mock-acpi compose setup and runs validator to ensure functionality.

Key-features:
- Workflow is triggered on pull requests that include a specific commit
  message `/test-acpi`.
- Environment setup is handled by ansible.

Signed-off-by: Vibhu Prashar <[email protected]>
…puting-io/add-acpi-wkf

feat(ci): implement mock-ACPI workflow
…server-patch-1

fix: format ComponentModelWeights
This commit moves model_weights from/var/lib/kepler/data/ to its own
directory - var/lib/kepler/data/model_weights/ this allows additional data
like machine-spec to be stored its own directory.

Additionally this change fixes the blank cpu.yaml that gets created when
running compose files.

Signed-off-by: Sunil Thaha <[email protected]>
…k-cpu-yaml

chore: move model_weights to its own directory
This commit resolves two key issues with the mock-acpi workflow:
- Checkout correct branch: The workflow previously checkout out the
  default branch when triggered by a pull request. This fix ensures
  that the correct pull request branch is checked out during CI run.
- Attach workflow to pull request checks: The workflow was not
  being reflected under pull request checks. With this fix, the workflow
  will be correctly attached, ensuring its status visible and reported
  under pull request checks.

Signed-off-by: Vibhu Prashar <[email protected]>
…eanup-exporter-globals

chore: cleanup globals in exporter
…server-patch-1

fix: set default trainer only for local regressor
…pi-wk-status

fix(ci): ensure proper status reporting for mock-acpi workflow
This commit addresses an issue where the `cleanup` and `final status`
jobs were incorrectly dependent on the `create-runner` job, leading to
their premature execution. This fix ensures that both `cleanup` and
`final status` job run independently and only after all necessary
preceding jobs have finished

Signed-off-by: Vibhu Prashar <[email protected]>
Signed-off-by: Sunyanan Choochotkaew <[email protected]>
…server-patch-1

feat: add --disable-power-meter option
…x-job-flow

fix(ci): ensure independent execution of cleanup and status jobs
…idation

feat: Export validation result as json object.
This commit addresses the issue of multiple jobs defined in the
`mock-acpi` workflow, which were unintentionally executing in parallel
due to sequence constraints. By consolidating the workflow into a single
job, we ensure that the tasks are executed sequentially

Signed-off-by: Vibhu Prashar <[email protected]>
…puting-io/fix-flow

fix(ci): consolidate mock-acpi workflow into single job
vprashar2929 and others added 12 commits December 5, 2024 10:17
This commit migrates the mock-acpi workflow to use the
GitHub runner instead of the Equinix self-hosted runner.
Since the workflow is designed for testing ACPI functionality,
using a mock, a self-hosted runner is not required.

Running the workflow on the GitHub runner, which operates as a VM,
enables execution on every pull request, ensuring consistent validation
of ACPI functionality for Kepler.

Signed-off-by: vprashar2929 <[email protected]>
…pi-wkf

chore(ci): migrate mock-acpi workflow to GH runner
* [test]: add test case on package cgroup

Signed-off-by: Sam Yuan <[email protected]>

* [fix]: update cache setting logic when error happen

Signed-off-by: Sam Yuan <[email protected]>

* [fix]: fix lint

Signed-off-by: Sam Yuan <[email protected]>

---------

Signed-off-by: Sam Yuan <[email protected]>
…om-scaph

feat(compose): add fallback scrape protocol for Scaphandre service
Signed-off-by: Mario Vazquez <[email protected]>
Signed-off-by: Huamin Chen <[email protected]>
…race

feat(sensor): support NVIDIA Grace Hopper
…e-computing-io#1889)

Bumps the github-actions group with 4 updates: [actions/checkout](https://github.com/actions/checkout), [actions/attest-build-provenance](https://github.com/actions/attest-build-provenance), [actions/attest-sbom](https://github.com/actions/attest-sbom) and [codecov/codecov-action](https://github.com/codecov/codecov-action).


Updates `actions/checkout` from 3 to 4
- [Release notes](https://github.com/actions/checkout/releases)
- [Commits](actions/checkout@v3...v4)

Updates `actions/attest-build-provenance` from 1 to 2
- [Release notes](https://github.com/actions/attest-build-provenance/releases)
- [Changelog](https://github.com/actions/attest-build-provenance/blob/main/RELEASE.md)
- [Commits](actions/attest-build-provenance@v1...v2)

Updates `actions/attest-sbom` from 1 to 2
- [Release notes](https://github.com/actions/attest-sbom/releases)
- [Changelog](https://github.com/actions/attest-sbom/blob/main/RELEASE.md)
- [Commits](actions/attest-sbom@v1...v2)

Updates `codecov/codecov-action` from 5.0.7 to 5.1.1
- [Release notes](https://github.com/codecov/codecov-action/releases)
- [Changelog](https://github.com/codecov/codecov-action/blob/main/CHANGELOG.md)
- [Commits](codecov/codecov-action@v5.0.7...v5.1.1)

---
updated-dependencies:
- dependency-name: actions/checkout
  dependency-type: direct:production
  update-type: version-update:semver-major
  dependency-group: github-actions
- dependency-name: actions/attest-build-provenance
  dependency-type: direct:production
  update-type: version-update:semver-major
  dependency-group: github-actions
- dependency-name: actions/attest-sbom
  dependency-type: direct:production
  update-type: version-update:semver-major
  dependency-group: github-actions
- dependency-name: codecov/codecov-action
  dependency-type: direct:production
  update-type: version-update:semver-minor
  dependency-group: github-actions
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
…ble-computing-io#1892)

Bumps [golang.org/x/crypto](https://github.com/golang/crypto) from 0.26.0 to 0.31.0.
- [Commits](golang/crypto@v0.26.0...v0.31.0)

---
updated-dependencies:
- dependency-name: golang.org/x/crypto
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
metal: metal # Job name for metal metrics, default is metal

url: http://localhost:9090 # Prometheus server URL
rate_interval: 60s # Rate interval for Promql, default is 20s, typically 4 x $scrape_interval
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explicitly using rate interval as 60s because:

Prometheus scrape Interval = 3s
Data points for 12s Interval(i.e 4* scrape interval) = 12/3 = 4 data points
Data points for 60s interval = 60/3 = 20 data points

With 20 data points, we get a smoother and more reliable estimate. When comparing two sum(rate(...)) a stable rate reduces the variability in MAE calculations leading to more accurate assessments.

@vprashar2929 vprashar2929 force-pushed the add-kep-reg branch 2 times, most recently from 28889fe to fe32d16 Compare December 16, 2024 13:06
@@ -1,5 +1,5 @@
global:
scrape_interval: 5s # Set the scrape interval to every 5 seconds. Default is every 1 minute.
scrape_interval: 3s # Set the scrape interval to every 5 seconds. Default is every 1 minute.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check why changed scrape interval, and update comment accordingly

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

setting scrape every 3 seconds rather than every 5 seconds, over a typical time window will collect significantly more data

@vprashar2929
Copy link
Collaborator Author

Here is sample CI run that would look like for reference once we have this merged: https://github.com/sustainable-computing-io/kepler-metal-ci/actions/runs/12366281744/job/34512777104

My idea is to use the equinix runners on demand on PR's. Reviewers or authors can add a comment in the PR something like /test-regression which will trigger a workflow like this which can test if metrics produced by PR code base Kepler are off to what is already present in latest

…y Kepler

This commit introduces functionality to validate essential metrics produced by Kepler
The following comparisons are included:

- Node Exporter Comparison
   - Validates `node_rapl_<package|core|dram>` metrics against `kepler_node_<package|core|dram>{dev}`

- Kepler Process Comparison
   - Compares `kepler_process_<package|core|dram|platform|other|uncore>{latest}` metrics to
      `kepler_process_<package|core|dram|platform|other|uncore>{dev}`

- Kepler Node Comparison
   - Validates `kepler_node_<package|core|dram|platform|other|uncore>{latest}` against
      `kepler_node_<package|core|dram|platform|other|uncore>{dev}`

Additionally, the following changes are made to existing functionality:

- Adds a new `metric_validations.yaml` file which includes promql queries for comparisons along with threshold values
- Update the existing `stressor.sh` script to now support few more parameters to make it more flexible
  - warmup time: time to wait before starting the stressor
  - cooldown time: time to wait after the stressor is finished
  - repeats: number of times to repeat the stressor. Since for
    regression test we don't want to repeat the stressor multiple times
- Adds a new `validator-regression.yaml` file which includes the configuration for the regression test

Signed-off-by: vprashar2929 <[email protected]>
@sthaha
Copy link
Collaborator

sthaha commented Apr 16, 2025

@vprashar2929 can we bring this into reboot ? Perhaps validator needs a bit of rewrite to only test baremetal

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.